Combining building blocks for parallel multi-level matrix multiplication

نویسندگان

  • Sascha Hunold
  • Thomas Rauber
  • Gudula Rünger
چکیده

EXTENDED ABSTRACT Matrix-matrix multiplication is one of the core computations in many algorithms from scientific computing or numerical analysis and many efficient realizations have been invented over the years, including many parallel ones. The current trend to use clusters of PCs or SMPs for scientific computing suggests to revisit matrix-matrix multiplication and investigate efficiency and scalability of different versions on clusters. In this talk we present parallel algorithms for matrix-matrix multiplication which are built up from several algorithms in a multilevel structure. On a single processor ATLAS or PHiPAC create very efficient implementations by adjusting the computation order to the specific memory hierarchy and by exploiting functional parallelism of the processor (SSE2). Parallel approaches include many methods based on decomposition like Cannon's algorithm, or the algorithm of Fox. Efficient implementation variants of the latter are SUMMA or PUMMA. Matrix-matrix multiplication by Strassen or Strassen-Winograd benefits from a reduced number of operations but require a special schedule for a parallel implementation. In the context of clusters of SMPs, mixed programming models like mixed task and data parallelism are important since efficiency and scalability can be improved by using multiprocessor tasks (M-tasks). Task parallel implementations or mixed matrix-matrix multiplications have already been proposed in literature. One possibility of parallelizing Strassen's algorithm is to distribute the seven intermediate results onto a group of processors of size 7 i , preferably in a ring or torus configuration. Other approaches include to mix the common Fox BMR method (broadcast multiply roll) with Strassen's algorithm. Another mixed parallel

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Data-parallel programming with Intel Array Building Blocks (ArBB)

Intel Array Building Blocks is a high-level data-parallel programming environment designed to produce scalable and portable results on existing and upcoming multiand many-core platforms. We have chosen several mathematical kernels a dense matrix-matrix multiplication, a sparse matrix-vector multiplication, a 1-D complex FFT and a conjugate gradients solver as synthetic benchmarks and representa...

متن کامل

A New Parallel Matrix Multiplication Method Adapted on Fibonacci Hypercube Structure

The objective of this study was to develop a new optimal parallel algorithm for matrix multiplication which could run on a Fibonacci Hypercube structure. Most of the popular algorithms for parallel matrix multiplication can not run on Fibonacci Hypercube structure, therefore giving a method that can be run on all structures especially Fibonacci Hypercube structure is necessary for parallel matr...

متن کامل

Fast recursive matrix multiplication for multi-core architectures

In this article, we present a fast algorithm for matrix multiplication optimized for recent multicore architectures. The implementation exploits different methodologies from parallel programming, like recursive decomposition, efficient low-level implementations of basic blocks, software prefetching, and task scheduling resulting in a multilevel algorithm with adaptive features. Measurements on ...

متن کامل

Fast Finite Element Method Using Multi-Step Mesh Process

This paper introduces a new method for accelerating current sluggish FEM and improving memory demand in FEM problems with high node resolution or bulky structures. Like most of the numerical methods, FEM results to a matrix equation which normally has huge dimension. Breaking the main matrix equation into several smaller size matrices, the solving procedure can be accelerated. For implementing ...

متن کامل

Chunks and Tasks: A programming model for parallelization of dynamic algorithms

We propose Chunks and Tasks, a parallel programming model built on abstractions for both data and work. The application programmer specifies how data and work can be split into smaller pieces, chunks and tasks, respectively. The Chunks and Tasks library maps the chunks and tasks to physical resources. In this way we seek to combine user friendliness with high performance. An application program...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:
  • Parallel Computing

دوره 34  شماره 

صفحات  -

تاریخ انتشار 2008